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Abstract 

Purely functional, embedded array programs are a good match for 
SIMD hardware, such as GPUs. However, the naive compilation 
of such programs quickly leads to both code explosion and an 
excessive use of intermediate data structures. The resulting slow- 
down is not acceptable on target hardware that is usually chosen to 
achieve high performance. 

In this paper, we discuss two optimisation techniques, sharing 
recovery and array fusion, that tackle code explosion and elimi- 
nate superfluous intermediate structures. Both techniques are well 
known from other contexts, but they present unique challenges 
for an embedded language compiled for execution on a GPU. We 
present novel methods for implementing sharing recovery and array 
fusion, and demonstrate their effectiveness on a set of benchmarks. 

Categories and Subject Descriptors D.3.2 [Programming Lan- 
guages]: Language Classification — Applicative (functional) lan- 
guages; Concurrent, distributed, and parallel languages 

Keywords Arrays; Data parallelism; Embedded language; Dy- 
namic compilation; GPGPU; Haskell; Sharing recovery; Array fu- 
sion 

1. Introduction 

Recent work on stream fusion 1121 . the vector package 1231 . and 
the parallel array library Repa 1171 1191 1201 has demonstrated that 
(1) the performance of purely functional array code in Haskell 
can be competitive with that of imperative programs and that (2) 
purely functional array code lends itself to an efficient parallel 
implementation on control-parallel multicore CPUs. 

So far, the use of purely functional languages for programming 
data parallel SIMD hardware such as GPUs (Graphical Processing 
Units) has been less successful. Vertigo 1 13] was an early Haskell 
EDSL producing DirectX 9 shader code, though no runtime perfor- 
mance figures were reported. Nikola [22| produces code competi- 
tive with CUDA, but without supporting generative functions like 
replicate where the result size is not statically fixed. Obsidian 
1101 is additionally restricted to only processing arrays of a fixed, 
implementation dependent size. Additionally, both Nikola and Ob- 
sidian can only generate single GPU kernels at a time, so that in 



Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. Copyrights for components of this work owned by others than the 
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or 
republish, to post on servers or to redistribute to lists, requires prior specific permission 
and/or a fee. Request permissions from permissions@acm.org. 
ICFP '13, September 25-27, 201 3, Boston, MA, USA. 
Copyright is held by the owner/author(s). Publication rights licensed to ACM. 
ACM 978-1-4503-2326-0/13/09. . . $15.00. 
http://dx.doi.org/10.1145/2500365.2500595 



programs consisting of multiple kernels the intermediate data struc- 
tures must be shuffled back and forth across the CPU-GPU bus. 

We recently presented Accelerate, an EDSL and skeleton-based 
code generator targeting the CUDA GPU development environ- 
ment [8). In the present paper, we present novel methods for op- 
timising the code using sharing recovery and array fusion. 

Sharing recovery for embedded languages recovers the sharing 
of let-bound expressions that would otherwise be lost due to the 
embedding. Without sharing recovery, the value of a let-bound 
expression is recomputed for every use of the bound variable. 
In contrast to prior work [ 14] that decomposes expression trees 
into graphs and fails to be type preserving, our novel algorithm 
preserves both the tree structure and typing of a deeply embedded 
language. This enables our runtime compiler to be similarly type 
preserving and simplifies the backend by operating on a tree-based 
intermediate language. 

Array fusion eliminates the intermediate values and additional 
GPU kernels that would otherwise be needed when successive 
bulk operators are applied to an array. Existing methods such as 
foldr /build fusion |15 | and stream fusion 1121 are not applica- 
ble to our setting as they produce tail-recursive loops, rather than 
the GPU kernels we need for Accelerate. The NDP2GPU system 
of (4) does produce fused GPU kernels, but is limited to simple 
map/map fusion. We present a fusion method partly inspired by 
Repa's delayed arrays 1171 that fuses more general producers and 
consumers, while retaining the combinator based program repre- 
sentation that is essential for GPU code generation using skeletons. 

With these techniques, we provide a high-level programming 
model that supports shape-polymorphic maps, generators, reduc- 
tions, permutation and stencil-based operations, while maintaining 
performance that often approaches hand-written CUDA code. 

In summary, we make the following contributions: 

• We introduce a novel sharing recovery algorithm for type-safe 
ASTs, preserving the tree structure (Section[3j. 

• We introduce a novel approach to array fusion for embedded 
array languages (Section|4|. 

• We present benchmarks for several applications including 
Black-Scholes options pricing, Canny edge detection, and 
a fluid flow simulation, including a comparison with hand- 
optimised CPU and GPU code (Section|5j. 

This paper builds on our previous work on a skeleton-based CUDA 
code generator for Accelerate |8|. Although we motivate and eval- 
uate our novel approaches to sharing recovery and array fusion in 
this context, our contribution is not limited to Accelerate. Specifi- 
cally, our sharing recovery applies to any embedded language based 
on the typed lambda calculus and our array fusion applies to any 
dynamic compiler targeting bulk-parallel SIMD hardware. 

We discuss related work in detail in Section [6] The source 
code for Accelerate including our benchmark code is available 
from https : / /github . com/ AccelerateHS/ accelerate 



2. Optimising embedded array code 

Accelerate is a domain specific language, consisting of a carefully 
selected set of array operators that we can produce efficient GPU 
code for. Accelerate is embedded in Haskell, meaning that we write 
Accelerate programs using Haskell syntax, but do not compile 
arbitrary Haskell programs to GPU machine code. Accelerate is 
stratified into array computations, wrapped in a type constructor 
Acc, and scalar expressions, represented by terms of type Exp t, 
where t is the type of value produced by the expression. This 
stratification is necessary due to the hardware architecture of GPUs 
and their reliance on SIMD parallelism, as we discussed in our 
previous work (8). 

2.1 Too many kernels 

For example, to compute a dot product, we use: 

dotp : : Vector Float -> Vector Float 

-> Acc (Scalar Float) 
dotp xs ys = let xs' = use xs 
ys' = use ys 
in fold (+) 0 (zipWith (*) xs' ys ' ) 

The function dotp consumes two one-dimensional arrays (Vector) 
of floating-point values and produces a single (Scalar) floating- 
point result. From the return type Acc (Scalar Float), we see 
that dotp is an embedded Accelerate computation, rather than 
vanilla Haskell code. 

The functions zipWith and fold are defined in our library 
Data. Array. Accelerate, and have massively parallel GPU im- 
plementations with the following type signatures: 

zipWith :: (Shape sh, Elt a, Elt b, Elt c) 
=> (Exp a -> Exp b -> Exp c) 
-> Acc (Array sh a) 
-> Acc (Array sh b) 
-> Acc (Array sh c) 

fold : : (Shape ix, Elt a) 

=> (Exp a -> Exp a -> Exp a) 
-> Exp a 

-> Acc (Array (ix:.Int) a) 
-> Acc (Array ix a) 

The type classes Elt and Shape indicate that a type is admissible 
as an array element and array shape, respectively. Array shapes 
are denoted by type-level lists formed from Z and ( : . ) — for 
example, Z : . Int : . Int is the shape of a two-dimensional array 
(see I8l ll7l for details). The type signatures of zipWith and fold 
clearly show the stratification into scalar computations using the 
Exp type constructor, and array computations wrapped in Acc. 

The arguments to dotp are of plain Haskell type Vector Float. 
To make these arguments available to the Accelerate computation 
they must be embedded with the use function: 

use : : Arrays arrays => arrays -> Acc arrays 

This function is overloaded so that it can accept entire tuples of 
arrays. Operationally, use copies array data from main memory 
to GPU memory, in preparation for processing by the GPU. 

The above Haskell version of the GPU-accelerated dot product 
is certainly more compact than the corresponding CUDA C code. 
However, when compiled with the skeleton-based approach we 
described in previous work |8|, it is also significantly slower. The 
CUDA C version executes in about half the time (see Table[2](. 

The slow-down is due to Accelerate generating one GPU ker- 
nel function for zipWith and another one for fold. In contrast, the 
CUDA C version only uses a single kernel. The use of two sepa- 
rate kernels requires an intermediate array to be constructed, and in 



a memory bound benchmark, such as dotp, this doubles the run- 
time. To eliminate this intermediate array we need to fuse the ad- 
jacent aggregate array computations. Although there is an existing 
body of work on array fusion, no existing method adequately deals 
with massively parallel GPU kernels. We present a suitable fusion 
framework as the first major contribution of this paper. 

2.2 Too little sharing 

As a second example, consider the pricing of European-style op- 
tions using the Black-Scholes formula. The Accelerate program is 
in Figure [T] Given a vector of triples of underlying stock price, 
strike price, and time to maturity (in years), the Black-Scholes for- 
mula computes the price of a call and a put option. The function 
callput evaluates the Black-Scholes formula for a single triple, 
and blackscholes maps it over a vector of triples, such that all 
individual applications of the formula are executed in parallel. 

If we compare the performance of the GPU code generated by 
Accelerate with that of an equivalent implementation in CUDA 
C, the Accelerate version is almost twenty times slower. As 
blackscholes includes only one aggregate array computation, 
the problem can't be a lack of fusion. Instead, as we noted previ- 
ously 1 8 1, it is due to the embedding of Accelerate in Haskell. 

The function callput includes a significant amount of sharing: 
the helper functions end ' , and hence also horner, are used twice 
— for dl and d2 — and its argument d is used multiple times in 
the body. Our embedded implementation of Accelerate reifies the 
abstract syntax of the (deeply) embedded language in Haskell. 
Consequently, each occurrence of a let-bound variable in the source 
program creates a separate unfolding of the bound expression in the 
compiled code. 

This is a well known problem that has been solved elegantly by 
the sharing recovery algorithm of Gill |14|, which makes use of 
stable names. Unfortunately, Gill's original approach (1) reifies the 
abstract syntax in graph form and (2) it assumes an untyped syntax 
representation. In contrast, Accelerate is based on a typed tree 
representation using GADTs and type families in conjunction with 
type-preserving compilation in most phases. In other words, we use 
Haskell's type checker to statically ensure many core properties 
of our Accelerate compiler. The fact that the compiler for the 
embedded language is type preserving, means that many bugs in 
the Accelerate compiler itself are caught by the Haskell compiler 
during development. This in turn reduces the number of Accelerate 
compiler bugs that the end-user might be exposed to. 

As we require typed trees, where sharing is represented by let 
bindings rather than untyped graphs, we cannot directly use Gill's 
approach to sharing recovery. Instead, we have developed a novel 
sharing recovery algorithm, which like Gill's, uses stable names, 
but unlike Gill's, operates on typed abstract syntax. Our algorithm 
produces a typed abstract syntax tree, and we are able to recover 
exactly those let bindings used in the source program. This is the 
second major contribution of this paper. 

2.3 Summary 

In summary, a straightforward skeleton-based implementation of 
an embedded GPU language suffers from two major inefficiencies: 
lack of sharing and lack of fusion. Both sharing recovery in em- 
bedded languages, and array fusion in functional languages have 
received significant prior attention. However, we have found that 
none of the existing techniques are adequate for a type-preserving 
embedded language compiler targeting massively parallel SIMD 
hardware, such as GPUs. 

3. Sharing recovery 

Gill 1141 proposed to use stable names |27| to recover the sharing 
of source terms in a deeply embedded language. The stable names 



blackscholes :: Vector (Float, Float, Float) 

-> Acc (Vector (Float, Float)) 
blackscholes = map callput . use 
where 

callput x = 



riskfree, volatility : : Float 
riskfree =0.02 
volatility =0.30 



horner 



Num a => [a] -> a -> a 



(price , 


strike, years) = unlift x 




horner coeff x = 


x * foldrl madd coeff 


r 


= constant riskfree 




where 




V 


= constant volatility 




madd a b = a 


+ x*b 


v_sqrtT 


= v * sqrt years 








dl 


= (log (price / strike) + 




end' : : Floating 


a => a -> a 




(r + 0.5 * v * v) * years) 


/ v_sqrtT 


end' d = 




d2 


= dl - v_sqrtT 




let poly = 


horner coeff 


end d 


= let c = end' d in d >* 0 ? 


(1.0 - c, c) 


coeff = 


[0.31938153, -0.356563782, 


cndDl 


= end dl 






1.781477937, -1.821255978, 


cndD2 


= end d2 






1.330274429] 


x_expRT 


= strike * exp (-r * years) 




rsqrt2pi = 
k 


0 . 39894228040143267793994605993438 
1.0 / (1.0 + 0.2316419 * abs d) 



lift ( price * cndDl - x_expRT * cndD2 in 

, x_expRT * (1.0 - cndD2) - price * (1.0 - cndDl)) rsqrt2pi * exp (-0.5*d*d) * poly k 



Figure 1. Black-Scholes option pricing in Accelerate 



of two Haskell terms are equal only when the terms are represented 
by the same heap structure in memory. Likewise, when the abstract 
syntax trees (ASTs) of two terms of an embedded language have 
the same stable name, we know that they represent the same value. 
In this case we should combine them into one shared AST node. As 
the stable name of an expression is an intensional property, it can 
only be determined in Haskell's 10 monad. 

Accelerate's runtime compiler preserves source types through- 
out most of the compilation process |8|. In particular, it converts 
the source representation based on higher-order abstract syntax 
(HOAS) to a type-safe internal representation based on de Bruijn 
indices. Although developed independently, this conversion is like 
the unembedding of Atkey et al. (TJ. Unembedding and sharing re- 
covery necessarily need to go hand in hand for the following rea- 
sons. Sharing recovery must be performed on the source represen- 
tation; otherwise, sharing will already have been lost. Nevertheless, 
we cannot perform sharing recovery on higher-order syntax as we 
need to traverse below the lambda abstractions used in the higher- 
order syntax representation. Hence, both need to go hand in hand. 

As shown by Atkey et al. |1], the conversion of HOAS in 
Haskell can be implemented in a type-preserving manner with the 
exception of one untyped, but dynamically checked environment 
look up, using Haskell's Typeable. To preserve maximum type 
safety, we do not want any further operations that are not type pre- 
serving when adding sharing recovery. Hence, we cannot use Gill's 
algorithm. The variant of Gill's algorithm used in Syntactic (2) 
does not apply either: it (1) also generates a graph and (2) discards 
static type information in the process. In contrast, our novel algo- 
rithm performs simultaneous sharing recovery and conversion from 
HOAS to a typed de Bruijn representation, where the result of the 
conversion is a tree (not a graph) with sharing represented by let 
bindings. Moreover, it is a type-preserving conversion whose only 
dynamically checked operation is the same environment lookup 
deemed unavoidable by Atkey et al. [TJ. 

In the following, we describe our algorithm using plain typed 
lambda terms. This avoids much of the clutter that we would incur 
by basing the discussion on the full Accelerate language. In addi- 
tion to its implementation in Accelerate, a working Haskell imple- 
mentation of our algorithm for the plain lambda calculus can be 
found at https : //github . com/mchakravarty/hoas-conv 



3.1 Our approach to sharing recovery 

Before we formalise our sharing recovery algorithm in the follow- 
ing subsections, we shall illustrate the main idea. Consider the fol- 
lowing source term: 

let inc = (+) 1 
in let nine = let three = inc 2 
in 

(*) three three 

in 

(-) (inc nine) nine 

This term's abstract syntax DAG is the leftmost diagram in Fig- 
ure]^] It uses @ nodes to represent applications; as in this grammar: 



C 



C 
x 

Xx. T 
7i @ T 2 
(constants) 



T T where 

C T 
x T 

\x T1 . T T2 
TP 



T2 @ r 2 T1 



rpT2 



The left definition does not track types, whereas the right one does. 
We implement typed ASTs in Haskell with GADTs and work with 
typed representations henceforth. Typed HOAS conversion with 
sharing recover proceeds in three stages: 

1 . Prune shared subterms: A depth first traversal over the AST an- 
notates each node with its unique stable name, where we build 
an occurrence map of how many times we've already visited 
each node. If we encounter a previously visited node, it repre- 
sents a shared subterm, and we replace it by a placeholder con- 
taining its stable name. The second diagram in Figure [2] shows 
the outcome of this stage. Each node is labeled by a number 
that represents its stable name, and the dotted edges indicate 
where we encountered a previously visited, shared node. The 
placeholders are indicated by underlined stable names. 

2. Float shared terms: All shared subterms float upwards in the 
tree to just above the lowest node that dominates all edges to 
the original position of that shared subterm — see the third 
diagram in Figure|2] Floated subterms are referenced by circled 
stable names located above the node that they floated to. If a 
node collects more than one shared subterm, the subterm whose 
origin is deeper in the original term goes on top — here, 9 on top 
of 5. Nested sharing leads to subterms floating up inside other 
floated subterms — here, 8 stays inside the subterm rooted in 5. 




@ 1 

/ \ 



(-) ! 



/ \ 



••• \ 



/ . 



8 



@ 9 
/ \ 




Figure 2. Recovering sharing in an example term 



3. Binder introduction: Each floated subterm gets let-bound right 
above the node it floated to (rightmost diagram in Figure [2j. 
While we use explicit, bound names in the figure, we introduce 
de Bruijn indices at the same time as introducing the lets. 

3.2 Prune shared subterms 

First, we identify and prune shared subtrees, producing a pruned 
tree of the following form (second diagram in Figure|2]l: 

°T T where 



£ : 


Orj-iT 


-- binder conversion level 


v T : 


Orj-iT 


~ pruned subtree (name) 


C T : 


Orj-iT 




\£.°T T2 : 


, O/xiTi— J-T2 




° T T1^T 2 @ a T T! . 


, OrpT2 





A stable name (here, of type Name) associates a unique name 
with each unique term node, so that two terms with the same stable 
name are identical, and are represented by the same data structure in 
memory. Here, we denote the stable name of a term as a superscript 
during pattern matching — e.g., \ v is a constant with stable name 
v, just as in the second and third diagram in Figure[2] 

An occurrence map, Q :: Name h-> Int, is a finite map that 
determines the number of occurrences of a Name that we encoun- 
tered during a traversal. The expression Qv yields the number of 
occurrences of the name v, and we have v G SI = (Q.v > 0). To 
add an occurrence to fi, we write vr> Q,. We will see in the next sub- 
section that we cannot simplify Q to be merely a set of occurring 
names. We need the actual occurrence count to determine where 
shared subterms should be let-bound. 

The identification and pruning of shared subtrees is formalised 
by the following function operating on closed terms from T T : 

prune :: Level — > (Name >-¥ Int) — > T T — > ((Name M> Int),°T T ) 
prune £ Q e v \ v G Q = (v E> Q, v) 
prune £ Q e" \ otherwise = enter (v r> Q) e 
where 



enter fi c 
enter Q (Xx.e) 



enter Q (ei @ e?) — 



XL 



(Q, c) 
let 

(Q\ 

in 

(Q', 
let 

{fh, 
{fa, 

in 

(Q 2 , e{ @ ei) 



prune (£ + 1) Q (\l/x\e) 



prune I Q e 1 
prune £ Qi ei 



The first equation of prune covers the case of a term's repeated 
occurrence. In that case, we prune sharing by replacing the term e" 
by a tag v_ containing its stable name — these are the dotted lines 
in the second diagram in Figure [2] 

To interleave sharing recovery with the conversion from HOAS 
to typed de Bruijn indices, prune tracks the nesting Level of 
lambdas. Moreover, the lambda case of enter replaces the HOAS 
binder x by the level £ at the binding and usage sites. 

Why don't we separate computing occurrences from tree prun- 
ing? When computing occurrences, we must not traverse shared 
subtrees multiple times, so we can as well prune at the same time. 
Moreover, in the first line of prune, we cannot simply return e in- 
stead of v — e is of the wrong form as it has type T and not °T\ 

As far as type-preservation is concerned, we do lose information 
due to replacing variables by levels £. This is the inevitable loss 
described by Atkey et al. [ 1 1, which we make up for by a dynamic 
check in an environment lookup, as already discussed. 

3.3 Float shared subterms 

Second, we float all shared subtrees out to where they should be 
let-bound, represented by (see third diagram in Figure[2]( 



where 

u T :: 

C T 



IT* 

r 



A term in comprises a sequence of floated-out subterms labelled 
by their stable name as well as a body term from *T from which 
the floated subterms where extracted. Moreover, the levels £ that 
replaced lambda binders in °T get replaced by the stable name of 
their term node. This simplifies a uniform introduction of de Bruijn 
indices for let and lam bda bound variables. 

We write v : for a possibly empty sequence of items: 
v\ : Ti, ...,!/„ : 'T n , where • denotes an empty sequence. 

The floating function float maintains an auxiliary structure of 
floating terms and levels, defined as follows: 



r 



t 



r 



These are floated subtrees named v of which we have collected i 
occurrences. The occurrence count indicates where a shared sub- 
term gets let bound: namely at the node where it matches Qv. 
This is why prune needed to collect the number of occurrences 
in Q. When the occurrence count matches Qv, we call the floated 



term saturated. The following function determines saturated floated 
terms, which ought to be let bound: 



bind :: (Name i-> 
bind Q • 

bind Q (y : e, f) 
feind i? (y : _, -T) 



Jrrf) - 



r 



— • 

z = v : e, 6md .f? J 1 
= bind J? F 



Note that J 1 does not keep track of the type r of a floated term ^T T ; 
hence, floated terms from bind come in an existential package. This 
does not introduce additional loss of type safety as we already lost 

the type of lambda bound variables in v : £. It merely means that let 
bound, just like lambda bound, variables require the dynamically 
checked environment look up we already discussed. 

When floating the first occurrence of a shared tree (not pruned 

by prune), we use v : ^T T . When floating subsequent occurrences 

(which were pruned), we use v : •. Finally, when floating a level, to 

replace it by a stable name, we use v : £. 

We define a partial ordering on floated terms: v\ : x < vi \ y 
iff the direct path from v\ to the root of the AST is shorter than 
that of V2 . We keep sequences of floated terms in descending order 
— so that the deepest subterm comes first. We write Fi W I2 to 
merge two sequences of floated terms. Merging respects the partial 
order, and it combines floated trees with the same stable name by 
adding their occurrence counts. To combine the first occurrence and 
a subsequent occurrence of a shared tree, we preserve the term of 
the first occurrence. We write r \ V to delete elements of r that 
are tagged with a name that appears in the sequence v. 

We can now formalise the floating process as follows: 



(r,^T T 



float :: (Name h-> Int) -> °T T 

float Q t = (v : I, u) 

float Q v_ — (v : u) 
float fi e v — let 

(r, e') — descend e 

Vh : eh — bind Q r 

d — v b : e b >- e' 

in 

if Qv —— 1 then 

(r\Vb, d) 

else 

(r \ T% W {v : d}, v) 

where 

descend :: °T T -> (T, l T T ) 
descend c — (•, c) 

descend (X£.e) — let 

(T, e') = float Q e 

in 

if 3j/ i. (v' \l) e Tthen 
(r\K>, Xi/.e') 

else 

(T, A_.e') 

descend (e% S &i) = let 

(jTi, e[) — float (2 e\ 
(r"2, e<2) = float Q e2 

in 

(ritar 2; ei@e^ 

The first two cases of float ensure that the levels of lambda bound 
variables and the names of pruned shared subterms are floated 
regardless of how often they occur. In contrast, the third equation 
floats a term with name v only if it is shared; i.e., Qv is not 1. If it 
is shared, it is also pruned; i.e., replaced by its name y_ — just as in 
the third diagram of Figure [2] 



Regardless of whether a term gets floated, all saturated floated 
terms, v b : e b , must prefix the result, e', and be removed from r. 

When descending into a term, the only interesting case is for 
lambdas. For a lambda at level £, we look for a floated level of the 
form v' : £. If that is available, v' replaces £ as a binder and we 
remove v' : £ from F. However, if v 1 : £ is not in F, the binder 
introduced by the lambda doesn't get used in e. In this case, we 
pick an arbitrary new name; here symbolised by an underscore 

3.4 Binder introduction 

Thirdly, we introduce typed de Bruijn indices to represent lambda 
and let binding structure (rightmost diagram in Figure[2](: 



lv T T where 

en' 



let 



_ 1 , env)rj-i T 2 

"Tl 1 i 



1 2 



envrpr 
envrpr 
envrpTi - 
envrpT2 



UrpT 



With this type of terms, e :: env T T means that e is a term repre- 
senting a computation producing a value of type r under the type 
environment env. Type environments are nested pair types, possi- 
bly terminated by a unit type (). For example, (((), n), to) is a 
type environment, where de Bruijn index 0 represents a variable of 
type ro and de Bruijn index 1 represents a variable of type r\. 

We abbreviate let e\ in • ■ ■ let e„ in e b as let einej. 
Both A and let use de Bruijn indices 1 instead of introducing 
explicit binders. 

To replace the names of pruned subtrees and of lambda bound 
variables by de Bruijn indices, we need to construct a suitable 
type environment as well as an association of environment entries, 
their de Bruijn indices, and the stable names that they replace. We 
maintain the type environment with associated de Bruijn indices in 
the following environment layout structure: 



™4™ where 



o 

env A env' 



env^Q 

env \ ( env , t) 



Together with a layout, we use a sequence of names V of the same 
size as the layout, where corresponding entries represent the same 
variable. As this association between typed layout and untyped 
sequence of names is not validated by types, the lookup function 
lyt I i getting the ith index of layout lyt makes use of a dynamic 
type check. It's signature is (|) :: N env A env ' ->■ e "V. 
Now we can introduces de Bruijn indices to body expressions: 

body :: °™A em -> V -> l T T -> env T T 
body lyt (u pfi , • • • , v p%n ) v_ 

v == v p ,i = lyt I 1 
body lyt~P~p~c = c 

body lyt T>p~ (\v.e) = \(binders lyt + (v,W^) e) 

body lytVp(e\<&e2)= (binders lyt Vp e\) 0 (binders lyt Vp e-z) 

The first equation performs a lookup in the environment layout 
at the same index where the stable name v occurs in the name 
environment V. The lookup is the same for lambda and let bound 
variables. It is the only place where we need a dynamic type check 
and that is already needed for lambda bound variables alone. 

In the case of a lambda, we add a new binder by extending 
the layout, denoted lyt + , with a new zeroth de Bruijn index and 
shifting all others one up. Keeping the name environment in sync, 
we add the stable name v, which used as a binder. 

In the same vein, we bind n floated terms v : e with let bind- 
ings in body expression e b , by extending the type environment n 
times (map applies a function to each element of a sequence): 



(Before fusion) 




(After producer/producer fusion) 




(After consumer/producer fusion) 




Figure 3. Produce/producer and consumer/producer fusion 

binders :: <^A env -> 17 -> t T T -> em T T 
binders lyt v^(y : e y et,) = 

let map (binders lyt Vp) ~e in fcody (y,~Vp) eh 

where n = length (v : e) 

We tie the three stages together to convert from HOAS with sharing 
recovery producing let bindings and typed de Bruijn indices: 

hoasShanng : : T T — ^ °T T 
hoasSharing e — let 

(J?, e') = prune 0 • e 

(•, e") = float Q e' 

in 

binders o • e" 

4. Array fusion 

Fusion in a massively data-parallel, embedded language for GPUs, 
such as Accelerate, requires a few uncommon considerations. 

Parallelism. While fusing parallel collective operations, we must 
be careful not to lose information essential to parallel execution. 
For example, foldr /build fusion 1151 is not applicable, because 
it produces sequential tail-recursive loops rather than massively 
parallel GPU kernels. Similarly, the split/ join approach used 
in Data Parallel Haskell (DPH) 1161 is not helpful, although fused 
operations are split into sequential and parallel subcomputations, as 
they assume an explicit parallel scheduler, which in DPH is written 
directly in Haskell. Accelerate compiles massively parallel array 
combinators to CUDA code via template skeleton instantiation, so 
any fusion system must preserve the combinator representation of 
the intermediate code. 

Sharing. Existing fusion transforms rely on inlining to move pro- 
ducer and consumer expressions next to each other, which allows 
producer/consumer pairs to be detected. However, when let-bound 



variables are used multiple times in the body of an expression, un- 
restrained inlining can lead to duplication of work. Compilers such 
as GHC, handle this situation by only inlining the definitions of let- 
bound variables that have a single use site, or by relying on some 
heuristic about the size of the resulting code to decide what to inline 
1 26 1 . However, in typical Accelerate programs, each array is used at 
least twice: once to access the shape information and once to access 
the array data; so, we must handle at least this case separately. 

Filtering. General array fusion transforms must deal with filter- 
like operations, for which the size of the result structure depends on 
the value of the input structure, as well as its size. Accelerate does 
not encode filtering as a primitive operation, so we do not need to 
consider it further[j 

Fusion at run-time. As the Accelerate language is embedded in 
Haskell, compilation of the Accelerate program happens at Haskell 
runtime rather than when compiling the Haskell program. For this 
reason, optimisations applied to an Accelerate program contribute 
to its overall runtime, so we must be mindful of the cost of analysis 
and code transformation. On the flip-side, runtime optimisations 
can make use of information that is only available at runtime. 

Fusion on typed de Brujin terms. We fuse Accelerate programs 
by rewriting typed de Bruijn terms in a type preserving manner. 
However, maintaining type information adds complexity to the def- 
initions and rules, which amounts to a partial proof of correctness 
checked by the type checker, but is not particularly exciting for the 
present exposition. Hence, in this section, we elide the steps neces- 
sary to maintain type information during fusion. 

4.1 The Main Idea 

All collective operations in Accelerate are array-to- array transfor- 
mations. Reductions, such as fold, which reduce an array to a sin- 
gle element, yield a singleton array rather than a scalar expression. 
Hence, we can partition array operations into two categories: 

1 . Operations where each element of the result array depends on at 
most one element of each input array. Multiple elements of the 
output array may depend on a single input array element, but 
all output elements can be computed independently. We refer to 
these operations as producers. 

2. Operations where each element of the result array depends on 
multiple elements of the input array. We call these functions 
consumers, in spite of the fact that they also produce an array. 

Table[T]summarises the collective array operations that we support. 
In a parallel context, producers are more pleasant to deal with be- 
cause independent element-wise operations have an obvious map- 
ping to the GPU. Consumers are a different story, as we need to 
know exactly how the computations depend on each other to im- 
plement them efficiently. For example, a parallel fold (with an asso- 
ciative operator) can be implemented efficiently as a tree reduction, 
but a parallel scan requires two separate phases t9l l31l . Unfortu- 
nately, this sort of information is obfuscated by most fusion tech- 
niques. To support the different properties of producers and con- 
sumers, our fusion transform is split into two distinct phases: 

• Producer/producer: fuse sequences of producers into a single 
producer. This is implemented as a source-to-source transfor- 
mation on the AST. 

• Consumer/producer: fuse producers followed by a consumer 
into the consumer. This happens during code generation, where 
we specialise the consumer skeleton with the producer code. 



filter is easily implemented as a combination of the core primitives, and 
is provided as part of the library. 



Producers 

: : (Exp a -> Exp b) -> Acc (Array sh a) -> Acc (Array sh b) 

: : (Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) 

-> Acc (Array sh c) 

backpermute : : Exp sh' -> (Exp sh' -> Exp sh) -> Acc (Array sh a) 

> Acc (Array sh' e) 
: Slice slix => Exp slix 

> Acc (Array (SliceShape slix) e) 

> Acc (Array (FullShape slix) e) 
: Slice slix 

=> Acc (Array (FullShape slix) e) -> Exp slix 

-> Acc (Array (SliceShape slix) e) 

generate : : Exp sh -> (Exp sh -> Exp a) -> Acc (Array sh a) 



map 

zipWith 



replicate 



slice 



map a function over an array 
apply funciton to. . . 
... a pair of arrays 
backwards permutation 

extend array across. . . 
. . . new dimensions 

remove existing dimensions 
array from index mapping 



Consumers 

fold : : (Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Array (sh: .Int) a) 

-> Acc (Array sh a) 
scan-(l,r} : : (Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Vector a) 

-> Acc (Vector a) 
permute : : (Exp a -> Exp a -> Exp a) -> Acc (Array sh' a) 

-> (Exp sh -> Exp sh') -> Acc (Array sh a) -> Acc (Array sh' a) 
stencil : : Stencil sh a stencil => (stencil -> Exp b) -> Boundary a 

-> Acc (Array sh a) -> Acc (Array sh b) 



tree reduction along. . . 
. . . innermost dimension 
left-to-right or right-to-left 
. . . vector pre-scan 
forward permutation 

map a function with local. . 
. . . neighbourhood context 



Table 1. Summary of Accelerate's core collective array operations, omitting Shape and Elt class constraints for brevity. In addition, there 
are other flavours of folds and scans as well as segmented versions of these. 



Separating fusion into these two phases reduces the complexity of 
the task, though there is also a drawback: as all collective opera- 
tions in Accelerate output arrays, we might wish to use the output 
of a consumer as an input to a producer as in map g . fold f z. 
Here, the map operation could be fused into the fold by apply- 
ing the function g to each element produced by the reduction be- 
fore storing the final result in memory. This is useful, as Accelerate 
works on multidimensional arrays, so the result of a fold can be 
a large array rather than just a singleton array. Our approach cur- 
rently does not fuse producer/consumer pairs, only consumer/pro- 
ducer and producer/producer combinations. 

Figure [3] illustrates how fusion affects the AST: blue boxes p\ 
to p7 represent producers, where ps is a producer like zipWith 
with two input arrays. The consumers are ci and C2. Firstly, we 
fuse all producers, with the exception of pi whose result is used by 
both ci and p2. Next, we plug the fused producers into consumers 
where possible. Again, p\ is left as is. It would be straightforward 
to change our implementation such that it would fuse p\ into both 
P2 and ci. This would duplicate the work of pi into both p2 and 
ci , which, despite reducing memory traffic, is not always advanta- 
geous. Our current implementation is conservative and never dupli- 
cates work; we plan to change this in future work as the restricted 
nature of Accelerate means that we can compute accurate cost es- 
timates and make an informed decision. In contrast, producer/con- 
sumer fusion of ci into P4 would require fundamental changes. 

4.2 Producer/producer fusion for parallel arrays 

The basic idea behind the representation of producer arrays in 
Accelerate is well known: simply represent an array by its shape 
and a function mapping indices to their corresponding values. We 
previously used it successfully to optimise purely functional array 
programs in Repa 1 17], but it was also used by others 1 1 1|. 

However, there are at least two reasons why it is not always 
beneficial to represent all array terms uniformly as functions. One is 
sharing: we must be able to represent some terms as manifest arrays 
so that a delayed-by-default representation can not lead to arbitrary 
loss of sharing. This is a well known problem in Repa. The other 



consideration is efficiency: since we are targeting an architecture 
designed for performance, we prefer more specific operations. An 
opaque indexing function is too general, conveying no information 
about the pattern in which the underlying array is accessed, and 
hence no opportunities for optimisation. We shall return to this 
point in Section[5] but already include a form of structured traversal 
over an array (Step) in the following definition: 

data DelayedAcc a where 
Done : : Acc a 

-> DelayedAcc a 

Yield : : (Shape sh, Elt e) 
=> Exp sh 
-> Fun (sh -> e) 
-> DelayedAcc (Array sh e) 

Step :: (Shape sh, Shape sh' , Elt e, Elt e') 
=> Exp sh' 
-> Fun (sh' -> sh) 
-> Fun (e -> e') 
-> Idx (Array sh e) 
-> DelayedAcc (Array sh' e') 

We have three constructors: Done injects a manifest array into the 
type. Yield defines a delayed array in terms of its shape and a func- 
tion which maps indices to elements. The third constructor, Step, 
encodes a special case of the more general Yield that represents 
the application of an index and/or value space transformation to the 
argument array. The type Fun (sh -> e) is that of a term rep- 
resenting a scalar function from shape to element type. The type 
Idx (Array sh e) is that of a de Bruijn index representing an 
array valued variable. Representing the argument array in this way 
means that both Step and Yield are non-recursive in Acc terms, 
and so they can always be expressed as scalar functions and em- 
bedded into consumers in the second phase of fusion. 

We represent all array functions as constructors of the type 
DelayedAcc. Producer/producer fusion is achieved by tree con- 



traction on the AST, merging sequences of producers into a single 
one. All array producing functions, such as map and backpermute, 
are expressed in terms of smart constructors for the DelayedAcc 
type. The smart constructors manage the integration with succes- 
sive producers, as shown in the following definition of mapD, the 
delayed version of the map function: 

mapD : : (Shape sh, Elt a, Elt b) 

=> Fun (a -> b) 

-> DelayedAcc (Array sh a) 

-> DelayedAcc (Array sh b) 
mapD f (Step sh p g v) = Step sh p (f . g) v 
mapD f (Yield sh g) = Yield sh (f . g) 

The function composition operator ( . ) is overloaded here to work 
on scalar function terms. With this definition we now have the well 
known fusion rule that reduces mapD f . mapD g sequences to 
mapD (f . g) . Similarly, the definition of delayed backpermute 
means that backpermuteD sh p (backpermuteD _ q arr) re- 
duces to backpermute sh (q . p) arr: 

backpermuteD 

:: (Shape sh, Shape sh', Elt e) 

=> Exp sh' 

-> Fun (sh' -> sh) 

-> DelayedAcc (Array sh e) 

-> DelayedAcc (Array sh' e) 
backpermuteD sh' p acc = case acc of 

Step ix f a -> Step sh' (ix . p) f a 

Yield f -> Yield env sh' (f . p) 

Done env a -> Step sh' p identity (toldx a) 

Of course, combinations of maps with backpermutes also reduce to 
a single producer. 

As desired, this approach also works on producers which 
take their input from multiple arrays. This is in contrast to 
foldr /build [15|, which can fuse one of the input arguments, 
but not both. The definition of zipWithD considers all possible 
combinations of constructors (only some of which we list here) 
and can therefore fuse producers of both arguments: 

zipWithD :: (Shape sh, Elt a, Elt b, Elt c) 
=> Fun (a -> b -> c) 
-> DelayedAcc (Array sh a) 
-> DelayedAcc (Array sh b) 
-> DelayedAcc (Array sh c) 
zipWithD f (Yield shl gl) (Yield sh2 g2) 
= Yield (shl 'intersect' sh2) 

(\sh -> f (gl sh) (g2 sh)) 
zipWithD f (Yield shl gl) (Step sh2 ix2 g2 a2) 
= Yield ... 

In this manner, sequences of producers fuse into a single producer 
term; then, we turn them back into a manifest array using the 
function compute. It inspects the argument terms of the delayed 
array to identify special cases, such as maps or backpermutes, as 
shown in the following snippet of pseudo-code: 

compute : : DelayedAcc a -> Acc a 
compute (Done a) = a 

compute (Yield sh f) = Generate sh f 

compute (Step sh p f v) 

I sh == shape a, isld p, isld f 

= a 

I sh == shape a, isld p = Map f a 
I isld f = Backpermute sh p a 

I otherwise = Transform sh p f a 

where a = Avar v 



Since we operate directly on the AST of the program, we can 
inspect function arguments and specialise the code accordingly. 
For example, isld : : Fun (a->b) -> Bool checks whether a 
function term corresponds to the term Xx.x. 

4.3 Consumer/Producer Fusion 

Now that we have the story for producer/producer fusion, we dis- 
cuss how to deal with consumers. We pass producers encoded in 
the DelayedAcc representation as arguments to consumers, so that 
the consumers can compute the elements they need on-the-fly. Con- 
sumers themselves have no DelayedAcc representation, however. 

Consumers such as stencil, access elements of their argument 
array multiple times. These consumers are implemented carefully 
not to duplicate work. Indeed, even when the argument of such a 
consumer is a manifest array, the consumer should ensure that it 
caches already fetched elements, as CPUs impose a high perfor- 
mance penalty for repeated memory loads. Some consumers can be 
implemented more efficiently when given a producer expressed in 
terms of a function from a multi-dimensional array index to an ele- 
ment value. Other consumers prefer functions that map the flat lin- 
ear index of the underling array to its value. Our consumer-friendly 
representation of delayed arrays therefore contains both versions: 

data Embedded sh e 
= (Shape sh, Elt e) 

=> DelayedArray { extent : : Exp sh 

, index : : Fun (sh -> e) 

, linearlndex : : Fun (Int -> e) } 

embedAcc : : (Shape sh, Elt e) 

=> Acc (Array sh e) -> Embedded sh e 
embedAcc (Generate sh f) 

= DelayedArray sh f (f . (f romlndex sh) ) 
embedAcc (Map f (AVar v) ) 

= DelayedArray (shape v) (indexArray v) 

(linearlndexArray v) 

embedAcc . . . 

The function embedAcc intelligently injects each producer into 
the Embedded type by inspection of the constructor, as shown 
in the code snippet above. In theory, compute and embedAcc 
could be combined to go directly from the delayed representation 
which is convenient for producer/producer fusion to the one for 
consumer/producer fusion. However, these two steps happen at 
different phases in the compilation, so we want to limit the visibility 
of each delayed representation to the current phase. 

Producers are finally embedded into consumers during code 
generation. During code generation, the code for the embedded pro- 
ducers is plugged into the consumer template. The codegenAcc 
function inspects the consumer and generates code for the argu- 
ments of the consumer. It then passes these CUDA code snippets 
to a function specialised on generating code for this particular con- 
sumer (mkFold in the example below), which combines these snip- 
pets with the actual CUDA consumer code: 

codegenAcc : : DeviceProperties -> OpenAcc arrs 

-> Gamma aenv -> CUTranslSkel arrs 
codegenAcc dev (OpenAcc (Fold f z a) ) aenv 

= mkFold dev aenv (codeGenFun f) (codeGenExp z) 
(codegenDelayedAcc $ embedAcc a) 
codegenAcc dev (OpenAcc (Scanl f z a)) aenv 

As a result, the code producing each element is integrated directly 
into the consumer, and no intermediate array needs to be created. 



4.4 Exploiting all opportunities for fusion 

As mentioned previously, we need to be careful about fusing shared 
array computations, to avoid duplicating work. However, scalar 
Accelerate computations that manipulate array shapes, as opposed 
to the bulk array data, can lead to terms that employ sharing, but 
can never duplicate work. Such terms are common in Accelerate 
code, and it is important to that they do not inhibit fusion. 

Consider the following example that first reverses a vector with 
backpermute and then maps a function f over the result. Being 
a sequence of two producer operations, we would hope these are 
fused into a single operation: 

reverseMap f a 
= map f 

$ backpermute (shape a) (\i->length a-i-1) a 

Unfortunately, sharing recovery, using the algorithm from Sec- 
tion [3] causes a problem. The variable a is used three times in the 
arguments to backpermute; hence, sharing recovery will introduce 
a let binding at the lowest common meet point for all uses of the 
array. This places it between the map and backpermute functions: 

reverseMap f a 
= map f 
$ let v = a 

in backpermute (shape v) (\i->length v-i-1) v 

This binding, although trivial, prevents fusion of the two producers, 
and it does so unnecessarily. The argument array is used three 
times: twice to access shape information, but only once to access 
the array data — in the final argument to backpermute. 

Fortunately, there is a simple work around. Recall that our 
delayed array constructors Step and Yield carry the shape of 
the arrays they represent. Hence, we can eliminate all uses of the 
array that only access shape information, leaving us with a single 
reference to the array's payload. That single reference enables us to 
remove the let binding and to re-enable fusion. 

Similarly, we may float let bindings of manifest data out (across 
producer chains). This helps to expose further opportunities for 
producer/producer fusion. For example, we allow the binding of 
xs to float above the map so the two producers can be fused: 

map g $ let xs = use (Array . . . ) 
in zipWith f xs xs 

While floating let bindings opens up the potential for further 
optimisations, we are careful to not increase the lifetime of bound 
variables, as this would increase live memory usage. 

5. Benchmarks 

Benchmarks were conducted on a single Tesla T10 processor (com- 
pute capability 1.3, 30 multiprocessors = 240 cores at 1.3GHz, 
4GB RAM) backed by two quad-core Xenon E5405 CPUs (64-bit, 
2GHz, 8GB RAM) running GNU/Linux (Ubuntu 12.04 LTS). The 
reported GPU runtimes are averages of 100 runs. 

5.1 Execution Overheads 

Runtime program optimisation, code generation, kernel loading, 
data transfer, and so on can contribute significant overheads to short 
lived GPU computations. Accelerate mitigates these overheads via 
caching and memoisation. For example, the first time a particu- 
lar expression is executed it is compiled to CUDA code, which is 
reused for subsequent invocations. In the remainder of this section 
we factor out the cost of runtime code generation, and the associ- 
ated caching, by reporting just the runtimes of the GPU kernels — 
for our system as well as the others we compare against. 



5.2 Dot product 

Dot product uses the code from Section [2~T| Without fusion the in- 
termediate array produced by zipWith is created in GPU memory 
before being read back by fold. This is a simple memory-bound 
benchmark; hence, fusion roughly doubles the performance. 

The Data. Vector baseline is sequential code produced via 
stream fusion 1121 . running on the host CPU. The Repa version 
runs in parallel on all eight cores of the host CPU, using the fusion 
method reported in 1 17 1. The NDP2GPU version (4) compiles NESL 
code [ 6 1 to CUDA. The performance of this version suffers because 
the NDP2GPU compiler uses the legacy NESL compiler for the front- 
end, which introduces redundant administrative operations that are 
not strictly needed when evaluating a dot product. 

The Accelerate version is still slightly slower than CUBLAS. 
The fused code embeds a function of type (sh -> a) into the 
reduction. The extra arithmetic for converting the multidimentional 
sh index to a linear index is significant for this simple benchmark. 

5.3 Black-Scholes options pricing 

Black-Scholes uses the code from Figure [T] The source contains 
several let-bound variables that are used multiple times, and with- 
out sharing recovery the corresponding values are recomputed for 
each occurrence. The Accelerate version is faster than the reference 
CUDA version because the latter contains a common subexpression 
that is not eliminated by the CUDA compiler. The CUDA version 
is part of the official NVIDIA CUDA distribution. The common 
subexpression performs a single multiplication. 

5.4 N-Body gravitational simulation 

The n-body example simulates Newtonian gravitational forces on 
a set of massive bodies in 3D space, using the naive 0(n 2 ) algo- 
rithm. In a data-parallel setting, the natural implementation first 
computes the forces between every pair of bodies, before reduc- 
ing the components applied to each body using a segmented sum. 
Without fusion this approach requires 0(n 2 ) space for the interme- 
diate array of forces, which exhausts the memory of our device for 
more than about 5k bodies. With fusion, the reduction consumes 
each force value on-the-fly, so that the program only needs O(n) 
space to store the final force values. 

Even with fusion the hand- written CUDA version is over 10 x 
faster, as it uses on-chip shared memory to reduce the memory 
bandwidth requirements of the program. The shared memory is 
essentially a software managed cache, and making automatic use 
of it remains an open research problem 1211 . 

5.5 Mandelbrot fractal 

The Mandelbrot set is generated by sampling values c in the com- 
plex plane, and determining whether under iteration of the complex 
quadratic polynomial z n +i = 4 + c ma t \ z n\ remains bounded 
however large n gets. As recursion in Accelerate always proceeds 
through the host language, we define each step of the iteration as a 
collective operation and unfold the loop a fixed number of times. 

Table [2] shows that without fusion performance is poor because 
storing each step of the iteration saturates the memory bus. The 
CUDA version is about 70% faster because it includes a custom 
software thread block scheduler to manage the unbalanced work- 
load inherent to this benchmark. 

5.6 Fluid Flow 

Fluid flow implements Jos Stam's stable fluid algorithm [32|, a fast 
approximate algorithm intended for animation and games, rather 
than accurate engineering simulation. The core of the algorithm 
is a finite time step simulation on a grid, implemented as a matrix 
relaxation involving the discrete Laplace operator (V 2 ). Relaxation 







Contender 


Accelerate 


Accelerate 


Accelerate 


Benchmark 


Input Size 




(ms) 


full (ms) 


no fusion (ms) 


no sharing (ms) 


Black Scholes 


20M 


6.70 


(CUDA) 


6.19 


(92%) 


(not needed) 


116 (1731%) 


Canny 


16M 


50.6 


(OpenCV) 


78.4 


(155%) 


(not needed) 


82.7 (164%) 


Dot Product 


20M 


1.88 


(CUBLAS) 


2.35 


(125%) 


3.90 (207%) 


(not needed) 


Fluid Flow 


2M 


5461 


(Repa -N7) 


107 


(1.96%) 


(not needed) 


119 (2.18%) 


Mandelbrot (limit) 


2M 


14.0 


(CUDA) 


24.0 


(171%) 


245 (1750%) 


245 (1750%) 


N-Body 


32k 


54.4 


(CUDA) 


607 


(1116%) 


(out of memory) 


(out of memory) 


Radix sort 


4M 


780 


(Nikola) 


442 


(56%) 


657 (84%) 


657 (84%) 


SMVM (protein) 


4M 


0.641 


(CUSP) 


0.637 


(99%) 


32.8 (5115%) 


(not needed) 



Table 2. Benchmark Summary 



is performed by stencil convolution using the standard four-point 
Laplace kernel. The program is very memory intensive, performing 
approximately 160 convolutions of the grid per time step. The Repa 
version running on the host CPUs is described in 1201 ; it suffers 
from a lack of memory bandwidth compared with the GPU version. 

5.7 Canny edge detection 

Edge detection applies the Canny algorithm |7| to square images 
of various sizes. The algorithm consists of seven phases, the first 
six are naturally data parallel and performed on the GPU. The last 
phase uses a recursive algorithm to "connect" pixels that make up 
the output lines. In our implementation this phase is performed 
sequentially on a host CPU, which accounts for the non-linear 
slowdown visible with smaller images. We also show the runtime 
for just the first six data parallel phases. 

The data parallel phases are slightly slower than the baseline 
OpenCV version. This is as our implementation of stencil convolu- 
tion in Accelerate checks whether each access to the source array 
is out of bounds, and must have boundary conditions applied. To 
address this shortcoming we intend to separate computation of the 
border region which requires boundary checks, from the main in- 
ternal region which does not 1 19 1, but we leave this to future work. 

5.8 Radix sort 

Radix sort implements the algorithm described in Blelloch (5) to 
sort an array of signed 32-bit integers. We compare our implemen- 
tation against a Nikola 1221 version^ 

The Accelerate version is faster then Nikola because Nikola is 
limited to single kernel programs and must transfer intermediate 
results back to the host. Hand written CUDA implementations 
such as in the Thrust |29 | library make use of on-chip shared 
memory and are approximately 10 x faster. As mentioned earlier, 
automatically making use of GPU shared memory remains an open 
research problem 1211 . 

5.9 Sparse-matrix vector multiplication (SMVM) 

SMVM multiplies a sparse matrix in compressed row format 
(CSR) |9| with a dense vector. Table [3] compares Accelerate to 
the CUSP library |3 ], which is a special purpose library for sparse 
matrix operations. For test data we use a 14 matrix corpus derived 
from a variety of application domains (33). 

Compared to our previous work 1 8 1 the fusion transformation 
converts the program to a single segmented reduction. The corre- 
sponding reduction in memory bandwidth puts Accelerate on par 
with CUSP for several test matrices. In a balanced machine SMVM 
should be limited by memory throughput, and a dense matrix in 
sparse format should provide an upper bound on performance. 

2 We repeat figures of 1221 as Nikola no longer compiles with recent GHC 
versions. The figures from [22] were obtained using the same GPU as ours. 



Name 


Non-zeros 
(nnz/row) 


CUSP 


Accelerate 


Accelerate 
no fusion 


Dense 


4M (2K) 


14.48 


14.62 


3.41 


Protein 


4.3M(119) 


13.55 


13.65 


0.26 


FEM/Spheres 


6M (72) 


12.63 


9.03 


4.70 


FEM/Cantilever 


4M (65) 


11.98 


7.96 


4.41 


Wind Tunnel 


11. 6M (53) 


11.98 


7.33 


4.62 


FEM/Harbour 


2.37M (50) 


9.42 


6.14 


0.13 


QCD 


1.9M (39) 


7.79 


4.66 


0.13 


FEM/Ship 


3.98 (28) 


12.28 


6.60 


4.47 


Economics 


1.27M (6) 


4.59 


0.90 


1.06 


Epidemiology 


1.27M (4) 


6.42 


0.59 


0.91 


FEM/ Accelerator 


2.62M (22) 


5.41 


3.08 


2.92 


Circuit 


959k (6) 


3.56 


0.82 


1.08 


Webbase 


3.1M(3) 


2.11 


0.47 


0.74 


LP 


11. 3M (2825) 


5.22 


5.04 


2.41 



Table 3. Overview of sparse matrices tested and results of the 
benchmark. Measurements are in GFLOPS/s (higher is better). 

However, with matrices such as FEM/Spheres which contain 
only a few non-zeros per row (< 2 x warp size = 64), Accelerate 
is slightly slower than CUSP. We conjecture that this is due to the 
way our skeleton code vector read of each matrix row is coalesced 
and aligned to the warp boundary to maximise global memory 
throughput, but is then not able to amortise this extra startup cost 
over the row length. 

Matrices with large vectors and few non-zeros per row (e.g., 
Epidemiology), exhibit low flop/byte ratio and poorly suit the CSR 
format, with all implementations performing well below peak. 

6. Related work 

Repa 1 17 1 is a Haskell library for parallel array programming on 
shared-memory SMP machines. Repa uses the delayed/manifest 
representation split on which our DelayedAcc type is based, 
though the idea of representing arrays as functions is folklore. With 
Repa the conversion between array representations is done manu- 
ally and can cause shared expressions to be recomputed rather than 
stored. Such recomputation can improve runtime performance de- 
pending on the algorithm. In Accelerate the conversion is automatic 
and conservative, so that shared expressions are never recomputed. 

Vertigo (13), Nikola |22) and Obsidian (10) are EDSLs in 
Haskell and were mentioned in Section [TJ Vertigo is a first-order 
language for writing shaders, and does not provide higher-order 
combinators such as map and fold. Nikola uses an instance of 
Gill's approach 1 14 1 to sharing recovery, is limited to single GPU 
kernel programs, and performs no fusion. 
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Obsidian 1101 is a lower level language where more details of 
the GPU hardware are exposed to the programmer. Recent versions 
of Obsidian [11] implement Repa-style delayed pull arrays as well 
as push arrays. Whereas a pull array represents a general producer, 
a push array represents a general consumer. Push arrays allow the 
intermediate program to be written in continuation passing style 
(CPS), and helps to compile (and fuse) append-like operations. 

Baracuda 1 18 1 is another Haskell EDSL that produces CUDA 
GPU kernels, though is intended to be used offline, with the kernels 
being called directly from C++. The paper 1 18] mentions a fusion 
system that appears to be based on pull arrays, though the mecha- 
nism is not discussed in detail. Barracuda steps around the sharing 
problem by requiring let-bindings to be written using the AST node 
constructor, rather than using Haskell's native let-expressions. 

Delite/LMS |28| is a parallelisation framework for DSLs in 
Scala that uses library-based multi-pass staging to specify complex 
optimisations in a modular manner. Delite supports loop fusion for 
DSLs targeting GPUs using rewrite rules on a graph-based IR. 

NDP2GPU |U compiles NESL code down to CUDA. As the 
source language is not embedded there is no need for sharing 
recovery. NDP2GPU performs map/map fusion but cannot fuse 
maps into reduction combinators. 

Sato and Iwasaki 1 30 1 describe a C++ library for GPGPU pro- 
gramming that includes a fusion mechanism based on list homo- 
morphisms |25|. The fusion transformation itself is implemented 
as a source to source translation. SkeTo 1241 is a C++ library that 
provides parallel skeletons for CPUs. SkeTo's use of C++ templates 
provides a fusion system similar to delayed arrays, which could be 
equivalently implemented using CUDA templates. The authors of 
SkeTo note that the lack of type inference in C++ leads them to 
write their array code as nested expressions — to avoid intermedi- 
ate variable bindings and their required type annotations. 
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